AITopics | pretraining task-agnostic visiolinguistic representation

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing SystemsDec-25-2025, 23:17:19 GMT

We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

name change, pretraining task-agnostic visiolinguistic representation, vilbert, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.61)
Information Technology > Artificial Intelligence > Natural Language (0.41)

Add feedback

Reviews: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing SystemsJan-27-2025, 00:29:44 GMT

I think that this paper is a solid extension of masked language model pre-training to image-and-text (e.g., captioning) tasks. It defines two novel but intuitive pre-training tasks for this scenario: (i) predicting the semantic class of masked image regions given the surrounding image regions (from the same image) and the corresponding text, (ii) predicting whether image and text pairs are aligned. They demonstrate significant improvements over both the previous SOTA and the strong baseline of simply using a pre-trained text-only BERT model. They also show that having two encoders (with different parameters), one for images and one for text, is superior to a joint encoder. I would have liked to have seen more ablation of the pre-training tasks, since I think that this is more interesting than the model depth ablation that the authors performed.

ablation, pretraining task-agnostic visiolinguistic representation, vision-and-language task, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.88)
Information Technology > Artificial Intelligence > Natural Language (0.61)

Add feedback

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing SystemsOct-10-2024, 22:24:15 GMT

We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

pretraining task-agnostic visiolinguistic representation, vilbert, vision-and-language task, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.65)
Information Technology > Artificial Intelligence > Natural Language (0.45)

Add feedback

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Lu, Jiasen, Batra, Dhruv, Parikh, Devi, Lee, Stefan

Neural Information Processing SystemsMar-18-2020, 20:18:50 GMT

We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability. Papers published at the Neural Information Processing Systems Conference.

pretraining task-agnostic visiolinguistic representation, vilbert, vision-and-language task, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.65)
Information Technology > Artificial Intelligence > Machine Learning (0.51)
Information Technology > Artificial Intelligence > Natural Language (0.45)

Add feedback

Collaborating Authors

pretraining task-agnostic visiolinguistic representation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Reviews: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks